arXiv:1508.04957vl [cs.DB] 20 Aug 2015 


Cohesiveness Relationships to Empower Keyword Search 

on Tree Data on the Web 


Aggeliki Dimitriou 
School of Electrical and 
Computer Engineering 
National Technical University 
of Athens, Greece 
angela@dblab.ntua.gr 


Ananya Dass 
Department of Computer 
Science 

New Jersey Institute of 
Technology, USA 
ad292@njit.edu 


Dimitri Theodoratos 
Department of Computer 
Science 

New Jersey Institute of 
Technology, USA 
dth@njit.edu 


ABSTRACT 

Keyword search has been for several years the most popu¬ 
lar technique for retrieving information over semistructured 
data on the web. The reason of this unprecedented success 
is well known and twofold: (1) the user does not need to 
master a complex query language to specify her requests for 
data, and (2) she does not need to have any knowledge of 
the structure of the data sources. However, these advantages 
come with two drawbacks: (1) as a result of the imprecision 
of keyword queries, there is usually a huge number of candi¬ 
date results of which only very few match the user’s intent. 
Unfortunately, the existing semantics are ad-hoc and they 
generally fail to “guess” the user intent. (2) As the number of 
keywords and the size of data grows the existing approaches 
do not scale satisfactorily. 


1. INTRODUCTION 

Currently, huge amount of data are exported and exchanged 
in tree-structure form |26[ |28| . Tree structures can conve¬ 
niently represent data that are semistructured, incomplete 
and irregular as is usually the case with data on the web. 
Keyword search has been by far the most popular technique 
for retrieving data on the web. There are two main reasons 
for this popularity: (a) The user does not need to mas¬ 
ter a complex query language (e.g., XQuery), and (b) the 
user does not need to have knowledge of the schema of the 
data sources over which the query is evaluated. In fact, in 
most cases the user does not even know which are these data 
sources. In addition, the same keyword query can be used to 
extract data from multiple sources with different structure 
and schema which relate the keywords in different ways. 


In this paper, we focus on keyword search on tree data and 
we introduce keyword queries which can express cohesive¬ 
ness relationships. Intuitively, a cohesiveness relationship 
on keywords indicates that the instances of these keywords 
in a query result should form a cohesive whole, where in¬ 
stances of the other keywords do not interpolate. Cohesive 
keyword queries allow also keyword repetition and cohesive¬ 
ness relationship nesting. Most importantly, despite their in¬ 
creased expressiveness, they enjoy both advantages of plain 
keyword search. We provide formal semantics for cohesive 
keyword queries on tree data which ranks query results on 
the proximity of the keyword instances. We design a stack 
based algorithm which builds a lattice of keyword partitions 
to efficiently compute keyword queries and further leverages 
cohesiveness relationships to significantly reduce the dimen¬ 
sionality of the lattice. We implemented our approach and 
ran extensive experiments to measure the effectiveness of 
keyword queries and the efficiency and scalability of our al¬ 
gorithm. Our results demonstrate that our approach outper¬ 
forms previous filtering semantics and our algorithm scales 
smoothly achieving interactive response times on queries of 
20 frequent keywords on large datasets. 


There is a price to pay for the simplicity and convenience of 
keyword search. Keyword queries are imprecise. As a conse¬ 
quence, there is usually a large number of results from which 
very few are relevant to the user intent. This leads to at least 
two drawbacks: (a) because of the large number of candi¬ 
date results, previous algorithms for keyword search are of 
high complexity and they cannot scale satisfactorily when 
the number of keywords and the size of the input dataset 
increase, and (b) correctly identifying the relevant results 
among the plethora of candidate results, is a very difficult 
task. Indeed, it is practically impossible for a search system 
to “guess” the user intent from a keyword query and the 
structure of the data source. Multiple approaches assign 
semantics to keyword queries by exploiting structural and 
semantic features of the data in order to automatically filter 
out irrelevant results. Examples include ELCA (Exclusive 


LCA) [I3j [33] [34], VLCA (Valuable LCA) [9j |Tt], CVLCA 
(Compact Valuable LCA) [TT], SLCA (Smallest LCA) 15 
32 |30| [8], MaxMatch [24], MLCA (Meaningful LCA) 21 
31 , and Schema level SLCA [12]. A survey of some of 
these approaches can be found in 25 . Although filtering 
approaches are intuitively reasonable, they are sufficiently 
ad-hoc and they are frequently violated in practice resulting 
in low precision and/or recall. A better technique is adopted 
by other approaches which rank the candidate results in de¬ 
scending order of their estimated relevance. Given that users 
are typically interested in a small number of query results, 
some of these approaches combine ranking with top-k algo¬ 
rithms for keyword search. The ranking is performed using 
a scoring function and is usually based on IR-style metrics 
for flat documents (e.g., tf*idf or PageRank) adapted to the 





tree-structure form of the data 0 m @ i m i m @ 


25 27 . Nevertheless, scoring functions, keyword occurrence 


statistics and probabilistic models alone cannot capture ef¬ 
fectively the intent of the user. As a consequence the pro¬ 


duced rankings are, in general, of low quality 31 


Our approach. In this paper, we introduce a novel ap¬ 
proach which allows the user to specify cohesiveness rela¬ 
tionships among keywords in a query, an option not offered 
by the current keyword query languages. Cohesiveness rela¬ 
tionships group together keywords in a query to define in¬ 
divisible (cohesive) collections of keywords. They partially 
relieve the system from guessing without affecting the user 
who is naturally tempted to specify such relationships. 


For instance, consider the keyword query {XML John Smith 
George Brown} to be issued against a large bibliographic 
database. The user is looking for publications on XML re¬ 
lated to the authors John Smith and George Brown. If the 
user can express the fact that the matches of John and Smith 
should form a unit where the matches of the other keywords 
of the query George, Brown and XML cannot slip into (that 
is, the matches of John and Smith form a cohesive whole), 
the system would be able to return more accurate results. 
For example, it will be able to filter out publications on XML 
by John Brown and George Smith. It will also filter out a 
publication which cites a paper authored by John Davis, a 
report authored by George Brown and a book on XML au¬ 
thored by Tom Smith. This information is irrelevant and no 
one of the previous filtering approaches (e.g., ELCA, VLCA, 
CVLCA, SLCA, MLCA, MaxMach etc.) is able to automat¬ 
ically exclude it from the answer of the query. Furthermore, 
specifying cohesiveness relationships prevents wasting pre¬ 
cious time searching for these irrelevant results. We spec¬ 
ify cohesiveness relationships among keywords by enclos¬ 
ing them between parentheses. For example, the previous 
keyword query would be expressed as (XML (John Smith) 
(George Brown)). 


Note that specifying a cohesiveness relationship on a set of 
keywords is different than phrase matching over flat text 
documents in IR. For instance, Google allows a user to spec¬ 
ify between quotes a phrase which has to be matched intact 
against the text document. In contrast, specifying a cohe¬ 
siveness relationship on a number of keywords to be eval¬ 
uated over a data tree does not impose any syntactic re¬ 
striction (e.g., distance restriction) on the matches of these 
keywords on the data tree. It only requires that the matches 
of these keywords form a cohesive unit. We provide formal 
semantics for queries with cohesiveness relationships on tree 
data in Section 2. 


Cohesiveness relationships can be nested. For instance the 
query (XML (JohnSmith) (citation (GeorgeBrown) )) looks 
for a paper on XML by John Smith which cites a paper by 
George Brown. The cohesive keyword query language conve¬ 
niently allows also for keyword repetition. For instance, the 
query (XML (JohnSmith) (citation (John Brown) )) looks 
for a paper on XML by John Smith which cites a paper by 
John Brown. 


Most importantly, despite its increased expressive power, the 
cohesive keyword query language enjoys of both advantages 


of traditional keyword search. Keyword queries with cohe¬ 
siveness relationships do not require any previous knowledge 
of a language or of the schema of the data sources and they 
do not require any effort by the user. The users are naturally 
tempted to express cohesiveness relationships when writing 
a keyword query and they can do it effortlessly with a cohe¬ 
sive query language. The benefits though in query answer 
quality compared to other flat approaches and performance 
are impressive. 


Contribution. The main contributions of our paper are as 
follows: 

• We formally introduce a novel keyword query language 
which allows for cohesiveness relationships, cohesiveness 
relationship nesting and keyword repetition. 


We provide ranking semantics for the cohesive keyword 


queries on tree data based on the concept of LCA size 11 
10 . The LCA size reflects the proximity of keywords in 
the data tree and, similarly to keyword proximity in IR, 
it is used to determine the relevance of the query results. 


• Our semantics interpret the subtrees rooted at the LCA 
of the instances of cohesively related keywords in the data 
tree as black boxes where the instances of the other key¬ 
words cannot interpolate. 


• We design an efficient multi-stack based algorithm which 
exploits a lattice of stacks—each stack corresponding to a 
different partition of the query keywords. Our algorithm 
does not rely on auxiliary index structures and, there¬ 
fore, can be exploited on datasets which have not been 
preprocessed. 

• We show how cohesive relationships can be leveraged to 
lower the dimensionality of the lattice and dramatically 
reduce its size and improve the performance of the algo¬ 
rithm. 


• We analyze our algorithm and show that for a constant 
number of keywords it is linear on the size of the input 
keywords’ inverted lists, i.e., to the dataset size. Our 
analysis further shows that the performance of our algo¬ 
rithm essentially depends on the maximum cardinality of 
the largest cohesive term in the keyword query. 

• We run extensive experiments on different real and 
benchmark datasets to assess the effectiveness of our ap¬ 
proach and the efficiency and scalability of our algorithm. 
Our results show that our approach largely outperforms 
previous filtering approaches achieving in most cases per¬ 
fect precision and recall. They also show that our algo¬ 
rithm scales smoothly when the number of keywords and 
the size of the dataset increase achieving interactive re¬ 
sponse times even with queries of 20 keywords having in 
total several thousands of instances on large datasets. 


2. DATA AND QUERY MODEL 

We consider data modeled as an ordered labeled tree. Tree 
nodes can represent XML elements or attributes. Every 
node has an id, a label (corresponding to an element tag 
or attribute name) and possibly a value (corresponding to 
the text content of an element or to an attribute’s value). 
For identifying tree nodes we adopt the Dewey encoding 
scheme [7|, which encodes tree nodes according to a pre¬ 
order traversal of the data tree. The Dewey encoding scheme 
naturally expresses ancestor-descendant and parent-child re- 
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lationships among tree nodes and conveniently supports the 
processing of nodes in stacks [13] . 

A keyword k may appear in the label or in the value of a 
node n in the data tree one or multiple times, in which case 
we say that node n constitutes an instance oik. A node may 
contain multiple distinct keywords in its value and label, in 
which case it is an instance of multiple keywords. 

A cohesive keyword query is a keyword query, which besides 
keywords may also contain groups of keywords called terms. 
Intuitively, a term expresses a cohesiveness relationship on 
keywords and/or terms. More formally, a keyword query is 
recursively defined as follows: 


Definition 1 (Cohesive keyword query). A term is 
a multiset of at least two keywords and/or terms. A cohe¬ 
sive keyword query is: (a) a set of a single keyword, or 
(b) a term. Sets and multisets are delimited within a query 
using parentheses. 


For instance, the expression ((title XML) ((John Smith) 
author)) is a keyword query. Some of its terms are T\ = 
(title XML), T 2 = ((John Smith) author), T 3 = (John 
Smith), and T 3 is nested into T 2 . 

A keyword may occur multiple times in a query. For in¬ 
stance, in the keyword query ((journal (Information Sys¬ 
tems) ((Information Retrieval) Smith) ) the keyword In¬ 
formation occurs twice, once in the term (Information 
Systems) and once in the term (Information Retrieval). 

In the rest of the paper, we may refer to a cohesive keyword 
query simply as query. The syntax of a query Q is defined 
by the following grammar where the non-terminal symbol T 
denotes a term and the terminal symbol k denotes a key¬ 
word: 

Q ->• (&) I T 

T —»• (S S) 

S ->■ S S I T I k 


We now move to define the semantics of cohesive keyword 
queries. Keyword queries are embedded into data trees. In 
order to define query answers, we need to introduce the con¬ 
cept of query embedding. In cohesive keyword queries, m 
occurrences of the same keyword in a query are embedded 
to one or multiple instances of this keyword as long as these 
instances collectively contain at least m times this keyword. 
Cohesive keyword queries may also contain terms, which, 
as mentioned before, express a cohesiveness relationship on 
their keyword occurrences. In tree structured data, the key¬ 
word instances in the data tree (which are nodes) are rep¬ 
resented by their LCA [29, 13] [25]. The instances of the 
keywords of a term in the data tree should be indivisible. 
That is, from the point of view of the instance of a keyword 
which is external to a term, the subtree rooted at the LCA 
of the instances of the keywords which are internal to the 
term is a black box. Therefore, for a given query embedding, 
the LCA of the instance of a keyword k which is not in a 
term T and the instance of a keyword in T should be the 
same as the LCA of the instance of k and the LCA l of the 
instances of all the keywords in T and different than l. 

As an example, consider query Q 1 =(XML keyword search 
(Paul Cooper) (Mary Davis)) issued against the data tree 
D 1 of Figure [l] In Figure [I] keyword instances are shown 
in bold and the instances of the keywords of a term below 
the same article are encircled and shaded. The mapping 
that assigns Paul to node 8, Mary and Cooper to node 9 and 
Davis to node 10 is not an embedding since from the point 
of view of Mary, the subtree rooted at article node 6, which 
is the LCA of authors 8 and 9 (the instances of Paul and 
Cooper, respectively), is not a black box: the instance of 
Mary (article node 9) is part of this subtree. These ideas are 
formalized below. 


Definition 2 (Query embedding). Let Q be a key¬ 
word query on a data tree D. An embedding of Q to D 
is a function e from every keyword occurrence in Q to an 
instance of this keyword in D such that: 

a. if ki,..., k m are distinct occurrences of the same keyword 
k in Q and e(fci) = ... = e(k m ) = n, then node n con¬ 
tains keyword k at least m times. 















Figure 2: Lattice of keyword partitions for the query (XML Query John Smith) 
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Figure 3: Lattices of keyword partitions for the query keywords XML, Query, John and Smith with different cohesiveness 
relationships 


b. if ki,... ,k„ are the keyword occurrences of a term T, k is 
a keyword occurrence not in T, and l = lca(e(ki),... , e(fe n )) 
then: (i) e(fci) = ... = e(k„), or (ii) lca(e(k), l) ^ l, Mi £ 
[!.»]• 

Given an embedding e of a query Q involving the keyword 
occurrences ki ,..., k m on a data tree D, the minimum con¬ 
necting tree (MCT) M of e on D is the minimum subtree 
of D that contains the nodes e(ki),... ,e(k m )- Tree M is 
also called an MCT of query Q on D. The root of M is 
the lowest common ancestor (LCA) of e(fci),..., e(k m ) and 
defines one result of Q on D. For instance, one can see that 
the article nodes 2 and 11 are results of the example query 
Qi on the example tree D\. In contrast, the article node 6 
is not a result of Qi. 

We use the concept of LCA size to rank the results in a query 
answer. Similarly to metrics for flat documents in informa¬ 
tion retrieval, LCA size reflects the proximity of keyword 
instances in the data tree. The size of an MCT is the num¬ 
ber of its edges. Multiple MCTs of Q on D with different 
sizes might be rooted at the same LCA node l. The size of 
l (denoted size(l)) is the minimum size of the MCTs rooted 
at l. 

For instance, the size of the result article node 2 of query 
Qi on the data tree of T) is 3 while that of the result article 
node 11 is 6 (note that there are multiple MCTs of different 


sizes rooted at node 11 in D\). 

Definition 3. The answer to a cohesive keyword query 
Q on a data tree D is a list [h ,..., l n ] of the LCAs of Q on 
D such that size{U ) < size(lj),i < j. 

For instance, article node 2 is ranked above article node 11 
in the answer of Q\ on D\. 

3. THE ALGORITHM 

We designed algorithm CohesiveLCA for keyword queries 
with cohesiveness relationships. Algorithm CohesiveLCA 
computes the results of a cohesive keyword query which 
satisfy the cohesiveness relationships in the query and are 
ranked in descending order of their LCA size. The intuition 
behind CohesiveLCA is that LCAs of keyword instances in a 
data tree result from combining LCAs of subsets of the same 
instances (i.e., partial LCAs of the query) bottom-up way 
in the data tree. CohesiveLCA progressively combines par¬ 
tial LCAs to eventually return full LCAs of instances of all 
query keywords higher in the data tree. During this process, 
LCAs are grouped based on the keywords contained in their 
subtrees. The members of these groups are compared among 
each other in terms of their size. CohesiveLCA exploits a 
lattice of partitions of the query keywords. 

The lattice of keyword partitions. During the execu- 
















Algorithm 1: CohesiveLCA 


1 

2 

3 

4 

5 

6 

7 

8 
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CohesiveLCA(Q: cohesive keyword query, invL: inverted lists) 
buildLattice() 

while currentNode <— getNextNodeFromInvertedLists() do 
currentPLCA «— createPartialLCA(currentNode, 0, null) 
push (initStack, currentPLCA) 
for every coarsenessLevel do 

while partialLCA «— next partial LCA of this 
coarsenessLevel do 

for every stack of coarsenessLevel containing 
termOf(partialLCA) do 
|_ push(stack, partialLCA) 


10 


emptyStacks()/* pop all entries from stacks 

sequentially from each coarseness level, letting 
subsequent levels process newly constructed 
partial LCAs */ 


11 push(stack, partialLCA) 

12 while stack.dewey not ancestor of pqartialLCA.node do 

13 pop (stack) 


14 

15 


while stack, dewey ^ pqartialLC A. node do 

addEmptyRow(stack) /* updating stack.dewey until 
it is equal to pqartialLCA.node */ 


16 


replace!fSmallerWith(stack.topRow, partialLCA.term, size) 


17 pop (stack) 

poppedEntry <— stack.pop() 
if stack, columns = 1 then 
I addResult(stack.dewey, popped[0].size) 


21 

22 

23 

24 


25 

26 

27 

28 


/* Produce new LCAs from two partial LCAs */ 

if stack, columns > 1 then 

for i<— 0 to stack, columns do 
for —i to stack, columns do 

if popped[i] and popped[j] contain sizes and 
popped[i].provenance fl poppedfj].provenance = 0 
then 

newTerm 4— findTerm(popped[i].term, 
popped [j] term) 

newSize <— popped[i].size+popped[j].size 
newProvenance <— popped[i].provenance U 
popped [j] .provenance 
pLCA «— newPartialLCA(stack.dewey, 
newTerm, newSize, newProvenance) 


29 

30 

31 

32 

33 


/* Update ancestor (i.e., new top entry) with sizes 
from popped entry */ 

if stack is not empty and stack, columns > 1 then 
for i=0 to stack, columns do 

if popped[i].sized-1 < stack, top Row [i]. size then 
stack.topRow[i] .size «— popped[i].size+l 
stack.topRow[i].provenance •<— 

_ {last Step (stack, dewey)} 


34 


removeLast Dewey Step (stack, dewey) 


tion of CohesiveLCA, multiple stacks are used. Every stack 
corresponds to a partition of the keyword set of the query. 
Each stack entry contains one element (partial LCA) for 
every keyword subset belonging to the corresponding parti¬ 
tion. As usual with stack based algorithms for processing 
tree structured data, stack entries are pushed and popped 
during the computation according to a preorder traversal 
of the data tree. Dewey codes are exploited to index stack 
entries which at any point during the execution of the algo¬ 
rithm correspond to a node in the data tree. Consecutive 
stack entries correspond to nodes related with parent-child 


relationships in the data tree. 

The stacks used by algorithm CohesiveLCA are naturally 
organized into a lattice, since the partitions of the keyword 
set (which correspond to stacks) form a lattice. Coarser 
partitions can be produced from finer ones by combining 
two of their members. Partitions with the same number 
of members belong to the same coarseness level of the lat¬ 
tice. Figure [2] shows the lattice for the keyword set of the 
query (XML Query John Smith). CohesiveLCA combines par¬ 
tial LCAs following the source to sink paths in the lattice as 
shown in Figure [2] 


Reducing the dimensionality of the lattice. The lat¬ 
tice of keyword partitions for a given query consists of all 
possible partitions of query keywords. The partitions re¬ 
flect all possible ways that query keywords can be combined 
to form partial and full LCAs. Cohesiveness relationships 
restrict the ways keyword instances can be combined in a 
query embedding ( Definition 2] ) to form a query result. 
Keyword instances may be combined individually with other 
keyword instances to form partial or full LCAs only if they 
belong to the same term: if a keyword a is “hidden” from a 
keyword b inside a term T a , then an instance of b can only 
be combined with an LCA of all the keyword instances of T a 
and not individually with an instance of a. These restric¬ 
tions result in significantly reducing the size of the lattice of 
the keyword partitions as exemplified next. 

Figure [3] shows the lattices of the keyword partitions of two 
queries. The queries comprise the same set of keywords XML, 
Query, John and Smith but involve different cohesive rela¬ 
tionships. The lattice of Figure [2] is the full lattice of 15 
keyword partitions and allows every possible combination of 
instances of the keywords XML, Query, John and Smith. The 
query of Figure |3a| imposes a cohesiveness relationship on 
John and Smith. This modification renders several parti¬ 
tions of the full lattice of Figure [2] meaningless. For instance 
in Figure |3a[ the partition [XJ, Q, S] is eliminated, since 
an instance of XML cannot be combined with an instance of 
John unless the instance of John is already combined with 
an instance of Smith, as is the case in the partition [XJS, 
Q]. The cohesiveness relationship on John and Smith reduces 
the size of the initial lattice from 15 to 7. A second cohesive¬ 
ness relationship between XML and Query further reduces the 
lattice to the size of 3, as shown in Figure [3b) Note that in 
this case, besides partitions that are not permitted because 
of the additional cohesiveness relationship (e.g., [XJS, Q]), 
some partitions may not be productive, which makes them 
useless. [XQ, J, S] is one such partition. The only valid 
combination of keyword instances that can be produced from 
this partition is [XQ, JS], which is a partition that can be 
produced directly from the source partition [X, Q, J, S] 
of the lattice. The same holds also for the partition [X, Q, 
JS]. Thus, these two partitions can be eliminated from the 
lattice. 


Algorithm description. Algorithm CohesiveLCA ac¬ 
cepts as input a cohesive keyword query and the inverted 
lists of the query keywords and returns all LCAs which sat¬ 
isfy the cohesiveness relationships of the query, ranked on 
their LCA size. 
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(iv)C(XML Keyword Search)(Paul Cooper)(Mary Davis)) 

Figure 4: Component and final lattices for the query ((XML Keyword Search) (Paul Cooper) (Mary Davis))) 


Function buildLattice 


1 buildLattice(Q: query) 

2 singletonTerms <— {keywords(Q)} 

3 stacks.add(createSourceStack(singletonTerms)) /* create 

the source stack from all singleton keywords */ 

4 constructControlSet(Q)/* produce control sets of 

combinable terms */ 

5 for every control set cset in controlSets with not only 
singleton keywords do 

6 |_ stacks.add(createSourceStack(cset)) 

7 for every s in stacks do 

8 buildComponentLattice(s) 


9 co nst ruct Control Sc t(r/p: query subpattern) 

10 c t— new Set() 

11 for every singleton keyword k in s do 

12 c.add(fc) 


13 

14 

15 


for every subpattem sqp in s do 

subpatternTerm <— constructControlSet(sqp) 
c.add(subpatternTerm) 


16 controlSets.add(c) 

17 return newTerm(c) /* return the cohesive term 

consisting of component keywords and terms */ 


18 

19 

20 

21 


buildComponentLattice(s: stack) 

for every pair tl, 12 of terms in s do 

newS <— newStack(s, tl, t2)/* construct new stack 
from s by combining tl and t2 columns of s */ 
buildComponentLattice(newS) 


The algorithm begins by building the lattice of stacks needed 
for the cohesive keyword query processing (line 2). This pro¬ 
cess will be explained in detail in the next paragraph. Af¬ 
ter the lattice is constructed, an iteration over the inverted 
lists (line 3) pushes all keyword instances into the lattice in 
Dewey code order starting from the source stack of the lat¬ 
tice, which is the only stack of coarseness level 0. For every 
new instance, a round of sequential processing of all coarse¬ 
ness levels is initiated (lines 6-9). At each step, entries are 
pushed and popped from the stacks of the current coarseness 
level. Each stack has multiple columns corresponding to and 
named by the keyword subsets of the relevant keyword par¬ 
tition. Each stack entry comprises a number of elements 
one for every column of the stack. Popped entries contain 
partial LCAs that are propagated to the next coarseness lev¬ 
els. An entry popped from the sink stack (located in the last 


coarseness level) contains a full LCA and constitutes a query 
result. After finishing the processing of all inverted lists, an 
additional pass over all coarseness levels empties the stacks 
producing the last results (line 10). 

Procedure push() pushes an entry into a stack after ensuring 
that the top stack entry corresponds to the parent of the 
partial LCA to be pushed (lines 11-16). This process triggers 
pop actions of all entries that do not correspond to ancestors 
of the entry to be pushed into. Procedure pop() is where 
partial and full LCAs are produced (lines 17-34). When an 
entry is popped, new LCAs are formed (lines 21-28) and the 
parent entry of the popped entry is updated to incorporate 
partial LCAs that come from the popped child entry (lines 
29-34). The construction of new partial LCAs is performed 
by combining LCAs stored in the same entry. 


Construction of the lattice. The key feature of Cohe- 
siveLCA is the dimensionality reduction of the lattice which 
is induced by the cohesiveness relationships of the input 
query. This reduction, as we also show in our experimental 
evaluation, has a significant impact on the efficiency of the 
algorithm. Algorithm CohesiveLCA does not naively prune 
the full lattice to produce a smaller one, but wisely con¬ 
structs the lattice needed for the computation from smaller 
component sublattices. This is exemplified in Figure [4] 


Consider the data tree depicted in Figure |T] and the query 
((XML Keyword Search) (Paul Cooper) (Mary Davis))) 
issued on this data tree. If each term is treated as a unit, 
a lattice of the partitions of three items is needed for the 
evaluation of the query. This is lattice (iv) of Figure |4a| 
Howerver, the input of this lattice consists of combinations 
of keywords and not of single keywords. These three com¬ 
binations of keywords each defines its own lattice shown in 
the left side of Figure |4a| (lattices (i), (ii) and (iii)). The 
lattice to be finally used by the algorithm CohesiveLCA is 
produced by composing lattices (i), (ii) and (iii) with lattice 
(iv) and is shown in Figure 4b This is a lattice of only 9 
nodes, whereas the full lattice for 7 keywords has 877 nodes. 


Function buildLatticeQ constructs the lattice for evaluat¬ 
ing a cohesive keyword query. This function calls another 
function buildComponentLattice() (line 8). Function build- 
ComponentLatticeQ (lines 18-21) is a recursive and builds 















































all lattices for all terms which may be arbitrarily nested. 
The whole process is controlled by the control Sets variable 
which stores the keyword subsets admissible by the input 
cohesiveness relationships. This variable is constructed by 
the procedure constructControlSet() (lines 9-17). 

3.1 Algorithm analysis 

Algorithm CohesiveLCA processes the inverted lists of the 
keywords of a query exploiting the cohesiveness relationships 
to limit the size of the lattice of stacks used. The size of a 
lattice of a set of keywords with k keywords is given by 
the Bell number of k, Bk, which is defined by the recursive 
formula: 

Bn +1 = Q Bi, Bo = Bi = 1 

In a cohesive query containing t terms the number of sublat¬ 
tices is t + 1 counting also the sublattice of the query (outer 
term). The size of the sublattice of a term with cardinality 
d is B Ci . A keyword instance will trigger in the worst case 
an update to all the stacks of all the sublattices of the terms 
in which the keyword participates. If the maximum nesting 
depth of terms in the query is n and the maximum cardinal¬ 
ity of a term or of the query itself is c, then an instance will 
trigger O (nB c ) stack updates. For a data tree with depth d, 
every processing of a partial LCA by a stack entails in the 
worst case d pops and d pushes, i.e., O (d). Every pop from 
a stack with c columns implies in the worst case c(c — l)/2 
combinations to produce partial LCAs and c size updates to 
the parent node, i.e., 0(e 1 2 3 4 ). Thus, the time complexity of 
CohesiveLCA is given by the formula: 

C 

0 {dncB c Y,\Si\) 

2 = 1 

where Si is the inverted list of the keyword i. The maximum 
term cardinality for a query with a given number of keywords 
depends on the number of query terms. It is achieved by 
the query when all the terms contain one keyword and one 
term with the exception of the innermost nested term which 
contains two keywords. Therefore, the maximum term car¬ 
dinality is k — t — 1 and the maximum nesting depth is t. 
Thus, the complexity of CohesiveLCA is: 

k 

0(dt(*-i-l) 2 2? fc _ t _ 1 £|S i |) 

2 = 1 

This is a paremeterized complexity which is linear to the size 
of the input (i.e., y) \Si\) for a constant number of keywords 
and terms. 

4. EXPERIMENTAL EVALUATION 

We experimentally studied the effectiveness of the Cohe¬ 
siveLCA semantics and the efficiency of the CohesiveLCA 
algorithm. 

The experiments were conducted on a computer with a 1.8GHz 
dual core Intel Core i5 processor running Mac OS Lion. The 
code was implemented in Java. 


We used the real datasets DBLlQ NASAQ and PSD [Jnd 
the benchmark auction dataset XMarlfl These datasets dis¬ 
play various characteristics. Table [l] shows their statistics. 
The DBLP is the largest and XMark the deepest dataset. 



DBLP 

XMark 

NASA 

PSD 

size 

1.15 GB 

116.5 MB 

25.1 MB 

683 MB 

maximum depth 

5 

11 

7 

6 

H nodes 

34,141,216 

2,048,193 

530,528 

22,596,465 

H keywords 

3,403,570 

140,425 

69,481 

2,886,921 

# distinct labels 

44 

77 

68 

70 

H distinct label paths 

196 

548 

110 

97 


Table 1: DBLP, XMark, NASA and PSD datasets’ statistics 

For the efficiency evaluation, we used the DBLP, XMark 
and NASA datasets in order to test our algorithm on data 
with different structural and size characteristics. For the 
effectiveness experiments, we use dthe largest real datasets 
DBLP and PSD. The keyword inverted lists of the parsed 
datasets were stored in a MySQL database. 

4.1 Efficiency of the CohesiveLCA algorithm 

In order to study the efficiency of our algorithm we used 
collections of queries with 10, 15 and 20 keywords issued 
against the DBLP, XMark and NASA datasets. For each 
query size, we formed 10 cohesive query patterns. Each 
pattern involves a different number of terms of different car¬ 
dinalities nested in various depths. For instance, a query 
pattern for a 10-keyword query is (xx((xxxx) (xxxx))). We 
used these patterns to generate keyword queries on the three 
datasets. The keywords were chosen randomly. In order to 
stress our algorithm, they were selected among the most fre¬ 
quent ones. In particular, for each pattern we generated 10 
different keyword queries and we calculated the average of 
their evaluation time. In total, we generated 100 queries for 
each dataset. For each query, we run experiments scaling the 
size of each keyword inverted list from 100 to 1000 instances 
with a step of 100 instances. 

Performance scalability on dataset size. Figure[5]shows 
how the computation time of CohesiveLCA scales when the 
total size of the query keyword inverted lists grows. Each 
plot corresponds to a different query size (10, 15 or 20 key¬ 
words) and displays the performance of CohesiveLCA on 
the three datasets. Each curve corresponds to a different 
dataset and each point in a curve represents the average 
computation time of the 100 queries that conform to the 10 
different patterns of the corresponding query size. Since the 
keywords are randomly selected among the most frequent 
ones the total size of the inverted lists reflects the size of the 
dataset. 

All plots clearly demonstrate that the computation time of 
CohesiveLCA is linear on the dataset size. This pattern is 
followed, in fact, by each one of the 100 contributing queries. 
In all cases, the evaluation times on the different datasets 
confirm the dependence of the algorithm’s complexity on the 
maximum depth of the dataset: the evaluation on DBLP 

1 http://www.informatik.uni-trier.de/ ley/db/ 

“http://www.cs.washington.edu/research 
/xmldatasets/www/repository.html 

3 smallhttp: //pir.georgetown.edu/ 

4 http://www.xml-benchmark.org 
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Figure 5: Performance of CohesiveLCA for queries with 20 keywords varying the number of instances 


(max depth 5) is always faster than on NASA (max depth 
7) which in turn is faster than on XMark (max depth 11). 

It is interesting to note that our algorithm achieves interac¬ 
tive computation times even with multiple keyword queries 
and on large and complex datasets. For instance, a query 
with 20 keywords and 20,000 instances needs only 20 sec 
to be computed on the XMark dataset. These results are 
achived on a prototype without the optimizations of a com¬ 
mercial keyword search system. To the best of our knowl¬ 
edge, there is no other experimental study in the relevant 
literature that considers queries of such sizes. 


11 10 keywords 0 □ 15 keywords 11 20 keywords 



Figure 6: Performance of CohesiveLCA on queries with 6000 
keyword instances for different maximum term cardinalities 
on the DBLP dataset 


Performance scalability on max term cardinality. As 

we showed in the analysis of algorithm CohesiveLCA (Sec¬ 
tion [3T]), the key factor which determines the algorithm’s 
performance is the maximum term cardinality in the input 
query. The maximum term cardinality determines the size 
of the largest sublattice contributed by the corresponding 
term(s) in the construction of the lattice ultimately used by 


the algorithm. This property is confirmed in Figure[6] Each 
bar shows the computation time of queries of a query size 
with 6000 keyword instances on the DBLP dataset. The 
x axis shows the maximum term cardinality of the queries. 
The computation time shown on the left y axis is averaged 
over all the queries of a query size with the specific maxi¬ 
mum cardinality. The curve shows the evolution of the size 
of the largest sublattice as the maximum term cardinality 
increases. The size of the sublattice is indicated by the num¬ 
ber of the stacks it contains (right y axis). 


Notice the importance of the maximum cardinality. It is 
interesting to observe that the computation time depends 
primarily on the maximum term cardinality and to a much 
lesser extent on the total number of keywords. For instance, 
a query of 20 keywords with a term with maximum cardinal¬ 
ity 6 is computed much faster than a query with 10 keywords 
with maximum cardinality 7. This observation shows that 
as long as the terms involved are not huge, CohesiveLCA is 
able to efficiently compute queries with a very large number 
of keywords. 


4.2 Effectiveness of CohesiveLCA semantics 


For our effectiveness experiments we used the real datasets 
DBLP and PSD. Table [l] lists the queries we evaluated on 
each dataset. The relevance of the results to the queries 
was provided by five expert users. In order to cope with a 
large number of query results we showed to the users the 
tree patterns of the query results, which are much less than 
the total number of results, from which they selected the rel¬ 
evant ones. We compared the CohesiveLCA semantics with 


the SLCA [l5 32 30, [8] and the ELCA 13', 33] 34 filter¬ 
ing semantics. Similarly to our approach, these semantics 
ignore the node labels and filter out irrelevant results based 
on the structural characteristics of the results. 


Table [3] shows the number of results for each query and ap¬ 
proach on the DBLP and PSD datasets. The CohesiveLCA 
approach returns all the results that satisfy the cohesiveness 
relationships in the query. Since these relationships are im¬ 
posed by the user, all other results returned by the other 
approaches are irrelevant. For instance, for query Q$ , only 
3 results satisfy the cohesiveness relationships of the user, 
and SLCA returns at least 37 and ELCA at least 40 irrel¬ 
evant results, respectively. CohesiveLCA further ranks the 
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Figure 7: Precision and E-measure of top size Cohesive LCA, SLCA and ELCA filtering semantics 
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(a) Precision % (b) E-measure % 


DBLP 

Q 1 

(proof (scott theorem)) 

Q° 

((ieee transactions communications) (wireless networks)) 

Q:i 

((Lei Chen) (Yi Guo)) 

Q? 

((wei wang) (yi chen)) 

Q° 

((vldb journal) (spatial databases)) 

PSD 

Q1 

((african snail) mRNA) 

Q2 

((alpha 1) (isoform 3)) 


((penton protein) (human adenovirus 5)) 

Ql 

((B cell stimulating factor) (house mouse)) 

Ql 

((spectrin gene) (alpha 1)) 


Table 2: DBLP and NASA queries for efectiveness experiments 

results on LCA size allowing the user to select results of top- 
k sizes. Usually, the results of the top size are sufficient as 
we show below. 


dataset 

query 

# of results 



cohesive 

SLCA 

ELCA 

DBLP 

QY 

2 

3 

4 


Q? 

527 

981 

982 


Q? 

2 

3 

4 


Q? 

11 

60 

61 


Q ? 

5 

8 

9 

PSD 

Q r 

3 

2 

3 


Ql 

14 

78 

79 


Ql 

2 

4 

4 


Ql 

4 

7 

8 


Ql 

3 

40 

43 


Table 3: Number of results of queries on DBLP and PSD datasets 

In order to compare CohesiveLCA with SLCA and ELCA, 
which provide filtering semantics, we select and return to the 
user only the top size results. These are the results with the 


minimum LCA size. The comparison is based on the widely 
used precision (P), recall (R) and E-measure= 2 p*p met¬ 
rics 3 . Figure [7] depicts the results for the three semantics 
on the two datasets. Since all approaches demonstrate high 
recall, we only show the precision and E-measure results in 
the interest of space. 

The diagram of Figure [7a] shows that CohesiveLCA largely 
outperforms the other two approaches in all cases. Co¬ 
hesiveLCA shows perfect precision for all queries on both 
datasets. It also shows perfect E-measure on the DBLP 
dataset. Its E-measure on the PSD dataset is lower. This is 
due to the following reason: contrary to the shallow DBLP 
dataset, the PSD dataset is deep and complex and produces 
results of various sizes for most of the queries. Some of the 
relevant results are not of minimum size and they are missed 
by top size CohesiveLCA. Nevertheless, all the recall mea¬ 
surements of CohesiveLCA exceed those of the other two 
approaches. Query highlights the inherent weakness of 
the SLCA and ELCA semantics. Not only do they return 
irrelevant results which do not satisfy cohesiveness relation¬ 
ships (low precision) but they also miss relevant LCAs which 
satisfy cohesiveness relationships (low recall) because they 
are ancestors of other relevant LCAs. In the case of , 
the SLCA semantics fails to return any correct result, al¬ 
though it returns in total 40 results (Table [ 3 ] , displaying 
0% precision and 0% recall. 

5. RELATED WORK 

Keyword queries facilitate the user with the ease of freely 
forming queries by using only keywords. Approaches that 
evaluate keyword queries are currently very popular espe¬ 
cially in the web where numerous sources contribute data 
often with unknown structure and where end users with no 
specialized skills need to find useful information. However, 
the imprecision of keyword queries results often in low pre¬ 
cision and/or recall of the search systems. Some approaches 
(a) combine structural constraints [9 with keyword search 





























































while others that (b) try to infer useful structural informa¬ 
tion implied by simple keyword queries [4 31 by exploit¬ 
ing statistical information of the queries and the underlying 
datasets. These approaches require a minimum knowledge 
of the dataset or a heavy dataset preprocessing in order to be 
able to accurately assess candidate keyword query results. 


not handle cohesive queries, which is the focus of the present 
work. 


The task of locating the nodes in a data tree which most 
likely match a keyword query has been extensively studied 
in [29] |9l [l5||T4||T7l[23]|33|[T6][T9|[8][T0|[5l[20|[22][3Tj 
|27| |1| |11| . All these approaches use LCAs of keyword in¬ 
stances as a means to define query answers. The smallest 
LCA (SLCA) semantics 32j[24] validates LCAs that do not 
contain other descendant LCAs of the same keyword set. A 
relaxation of this restriction is introduced by exclusive LCA 
(ELCA) semantics |l3j [33], which accepts also LCAs that 
are ancestors of other LCAs, provided that they refer to a 
different set of keyword instances. 


In a slightly different direction, semantic approaches account 
also for node labels and node correlations in the data tree. 
Valuable LCAs (VLCAs) 9 [17 and meaningful LCAs 
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(MLCAs) aim at “guessing” the user intent by exploiting the 
labels that appear in the paths of the subtree rooted at an 
LCA. All these semantics are restrictive and depending on 
the case, they may demonstrate low recall rates as shown in 

(3T1. 


The efficiency of algorithms that compute LCAs as answers 
to keyword queries depend on the query semantics adopted. 
By design they exploit the adopted filtering semantics to 
prune irrelevant LCAs early on in the computation. Stack 
based algorithms are naturally offered to process tree data. 
In 113] a stack-based algorithm that processes inverted lists 
of query keywords and returns ranked ELCAs was presented. 
This approach ranks also the query results based on precom¬ 
puted tree node scores inspired by PageRank 6 and IR style 
keyword proximity in the subtrees of the ranked ELCAs. In 
[32] , two efficient algorithms for computing SLCAs are in¬ 
troduced, exploiting special structural properties of SLCAs. 
This approach also introduces an extension of the basic al¬ 
gorithm, so that it returns all LCAs by augmenting the set 
of already computed SLCAs. Another algorithm for effi¬ 
ciently computing SLCAs for both AND and OR keyword 
query semantics is developed in [30]. The Indexed Stack [33 
and the Hash Count [34 algorithms improve the efficiency 
of 13] in computing ELCAs. Finally, [4[[5] elaborate on so¬ 
phisticated ranking of candidate LCAs aiming primarily on 
effective keyword query answering. 


Filtering semantics are often combined with (i) structural 
and semantic correlations |13[ 18[ |8, 311 [9 1 , (ii) statistical 

tic models 
Never- 


Pl [9[ PI [8[ [19 


measures 

0[27]|2f to perform a ranking to the results set. 
theless, such approaches require expensive preprocessing of 
the dataset which makes them impractical in the cases of 
fast evolving data and streaming applications. 


The concept of LCA size for ranking keyword query results 
was initially introduced in 10 : 11 . An algorithm that com¬ 


putes all the LCAs ranked on LCA size that exploits a lat¬ 
tice of stacks is also presented in these papers. However, 
this algorithm is designed for flat keyword queries and can- 

















6. CONCLUSION 

Current approaches for assigning semantics to keyword queries 
on tree data cannot cope efficiently or effectively with the 
large number of candidate results and produce answers of 
low quality. This poor performance cannot offset the con¬ 
venience and simplicity offered to the user by the keyword 
queries. In this paper we claim that the search systems 
cannot guess the user intent from the query and the charac¬ 
teristics of the data to produce high quality answers on any 
type of dataset and we introduce a cohesive keyword query 
language which allows the users to naturally and effortlessly 
express cohesiveness relationships on the query keywords. 
We design an algorithm which builds a lattice of stacks to 
efficiently compute cohesive keyword queries and leverages 
cohesiveness relationships to reduce its dimensionality. Our 
theoretical analysis and experimental evaluation show that 
it outperforms previous approaches in producing answers 
of high quality and scales smoothly succeeding to evaluate 
efficiently queries with a very large number of frequent key¬ 
words on large and complex datasets when previous algo¬ 
rithms for flat keyword queries fail. 

We are currently working on alternative ways for defining 
semantics for cohesive keyword queries on tree data and in 
particular in defining skyline semantics which considers all 
the cohesive terms of a query in order to rank the query 
results. 

7. REFERENCES 

[1] C. Aksoy, A. Dimitriou, D. Theodoratos, and X. Wu. 
XReason: A Semantic Approach that Reasons with 
Patterns to Answer XML Keyword Queries. In 
DASFAA, pages 299-314, 2013. 

[2] S. Amer-Yahia and M. Laimas. XML Search: 

Languages, INEX and Scoring. SIGMOD Record, 
35(4):16-23, 2006. 

[3] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modem 
Information Retrieval - the concepts and technology 
behind search, Second edition. Pearson Education Ltd., 
Harlow, England, 2011. 

[4] Z. Bao, T. W. Ling, B. Chen, and J. Lu. Effective 
XML Keyword Search with Relevance Oriented 
Ranking. In ICDE, pages 517-528, 2009. 

[5] Z. Bao, J. Lu, T. W. Ling, and B. Chen. Towards an 
Effective XML Keyword Search. IEEE Trans. Knowl. 
Data Eng., 22(8):1077-1092, 2010. 

[6] S. Brin and L. Page. The Anatomy of a Large-Scale 
Hypertextual Web Search Engine. Computer 
Networks, 30(1-7):107 117, 1998. 

[7] O. C. L. Center. Dewey Decimal Classification, 2006. 

[8] L. J. Chen and Y. Papakonstantinou. Supporting 
top-K Keyword Search in XML Databases. In ICDE, 
pages 689-700, 2010. 

[9] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. 

XSEarch: A Semantic Search Engine for XML. In 
VLDB, pages 45-56, 2003. 

[10] A. Dimitriou and D. Theodoratos. Efficient keyword 
search on large tree structured datasets. In KEYS, 
pages 63-74, 2012. 

[11] A. Dimitriou, D. Theodoratos, and T. Sellis. 

Top-k-size keyword search on tree structured data. 

Inf. Syst., 47:178-193, 2015. 


[12] R. Goldman and J. Widom. DataGuides: Enabling 
Query Formulation and Optimization in 
Semistructured Databases. In VLDB, pages 436-445, 
1997. 

[13] L. Guo, F. Shao, C. Botev, and 

J. Shanmugasundaram. XRANK: Ranked Keyword 
Search over XML Documents. In SIGMOD 
Conference, pages 16-27, 2003. 

[14] V. Hristidis, N. Koudas, Y. Papakonstantinou, and 
D. Srivastava. Keyword Proximity Search in XML 
Trees. IEEE Trans. Knowl. Data Eng., 18(4):525-539, 
2006. 

[15] V. Hristidis, Y. Papakonstantinou, and A. Balmin. 
Keyword proximity search on xml graphs. In ICDE, 
pages 367-378, 2003. 

[16] L. Kong, R. Gilleron, and A. Lemay. Retrieving 
meaningful relaxed tightest fragments for XML 
keyword search. In EDBT, pages 815-826, 2009. 

[17] G. Li, J. Feng, J. Wang, and L. Zhou. Effective 
Keyword Search for Valuable LCAs over XML 
documents. In CIKM, pages 31-40, 2007. 

[18] G. Li, C. Li, J. Feng, and L. Zhou. SAIL: 
Structure-aware Indexing for Effective and Progressive 
top-k Keyword Search over XML Documents. Inf. 

Sci., 179(21):3745-3762, 2009. 

[19] J. Li, C. Liu, R. Zhou, and W. Wang. Suggestion of 
Promising Result Types for XML Keyword Search. In 
EDBT, pages 561-572, 2010. 

[20] J. Li, C. Liu, R. Zhou, and W. Wang. Top-k Keyword 
Search over Probabilistic XML Data. In ICDE, pages 
673-684, 2011. 

[21] Y. Li, C. Yu, and H. V. Jagadish. Schema-Free 
XQuery. In VLDB, pages 72-83, 2004. 

[22] X. Liu, C. Wan, and L. Chen. Returning Clustered 
Results for Keyword Search on XML Documents. 
IEEE Trans. Knowl. Data Eng., 23(12):1811 1825, 
2011 . 

[23] Z. Liu and Y. Chen. Identifying meaningful return 
information for XML keyword search. In SIGMOD 
Conference, pages 329-340, 2007. 

[24] Z. Liu and Y. Chen. Reasoning and Identifying 
Relevant Matches for XML Keyword Search. PVLDB, 
l(l):921-932, 2008. 

[25] Z. Liu and Y. Chen. Processing Keyword Search on 
XML: a Survey. World Wide Web, 14(5-6):671-707, 
2011 . 

[26] B. Mozafari, K. Zeng, L. D’Antoni, and C. Zaniolo. 
High-performance complex event processing over 
hierarchical data. ACM Trans. Database Syst., 
38(4):21, 2013. 

[27] K. Nguyen and J. Cao. Top-k Answers for XML 
Keyword Queries. World Wide Web, 15(5-6):485-515, 
2012 . 

[28] P. Ogden, D. B. Thomas, and P. Pietzuch. Scalable 
XML query processing using parallel pushdown 
transducers. PVLDB, 6(14):1738-1749, 2013. 

[29] A. Schmidt, M. L. Kersten, and M. Windhouwer. 
Querying XML Documents Made Easy: Nearest 
Concept Queries. In ICDE, pages 321-329, 2001. 

[30] C. Sun, C. Y. Chan, and A. K. Goenka. Multiway 
SLCA-based Keyword Search in XML Data. In 
WWW, pages 1043-1052, 2007. 



[31] A. Termehchy and M. Winslett. Using Structural 
Information in XML Keyword Search Effectively. 
ACM Trans. Database Syst., 36(1):4, 2011. 

[32] Y. Xu and Y. Papakonstantinou. Efficient Keyword 
Search for Smallest LCAs in XML Databases. In 
SIGMOD Conference, pages 527-538, 2005. 


[33] Y. Xu and Y. Papakonstantinou. Efficient LCA based 
keyword search in XML data. In EDBT, pages 
535-546, 2008. 

[34] R. Zhou, C. Liu, and J. Li. Fast ELCA Computation 
for Keyword Queries on XML Data. In EDBT, pages 
549-560, 2010. 



